bigdata_uclfandomcom-20200215-history
Clustering
'''Big Title'''= '''Part 1 : Slides''' '''Part 2 : Project''' Improvements of k-means algorithm and application in cartography and imaging. = '''Review of Clustering Techniques''' = = '''Point Assignement Clustering''' '''Clustering with k-means : Problem and Algorithm''' '''''K-means Problem''''' Given a set of points C = \{p_1, ..., p_n\} \subset R^m and an integer k , we want to find a set of points, called centroids, M = \{m_1, ..., m_k\} \subset R^m math that minimizes {\sum_{i=1}^N (\min_{1\leq k} d(p_i,c_j))^2} . A cluster L_i is then quite naturally defined by the set of points that are closest to m_i than from any other centroïd. We can show that this problem is NP-hard, we thus need approximations algorithms to solve it. We presents such algorithms in the next scetion '''''K-means Problem'''''. '''''K-means Algorithms''''' The best known k-means algorithm is Lloyds'algorithm. Given points to cluster P, the k-means problem can be expressed as the minimization of g(C,L) , where g(C,L) = \sum_{i=1}^k \sum_{p \in L_i} d(p,m_i)^2 * Initialize the centroids M^{(0)} * Iterate until M^{(i)} = M^{(i-1)} ** i \leftarrow i+1 ** Assign each point to its closest centroïd : L^{(i)} = \arg \min_{L} g_1^{(i)}(L) = g(C^{(i-1)},L) ** Update the position of the centroïd to the one that minimize the inter-cluster variance, i.e. make m_i the geometric center of the points in L_i : C^{(i)} = \arg \min_{C} g_2^{(i)}(C) = g(C,L^{(i)}) The Lloyd's algorithm is thus simply a coordinate by coordinate optimization method. It converges to local optimum (\bar{C},\bar{L}) in a finite number of steps, but is not guaranted to find a global optimum. Indeed, even if \bar{g}_1(\bar{L}) and \bar{g}_2(\bar{C}) are minimized, g(\bar{C},\bar{L}) may not be minimal. It is trivially the case if for example the is only one non empty centroid m_1 that is the geometrical center of all points, and all other centroïds are such that for every point p, d(m_1,p) < d(m_i,p) . However, it it obvious that placing one empty centroïd on any point p would decrease the cost. There are other cases where Lloyd's algorithm finds a local optimum of the k-means problem, even when no clusters are empty. The proof of the finiteness of the number of steps needed to reach convergence is quite simple: at each iteration, the cost decreses strictly (except at the last iteration, which is precisely allows to detect the convergence). As there is only a finite number of configurations (actually there are \binom{n}{k} possible configurations), the number of iteration must be finite. From a practical point of view, this bound on the number of steps required is not very interesting because it is exactly the complexity of the brute force algorithm, which guarantees to find a global minimum. However we can find in the literature some tighter bounds on the number of iterations required, and in practice, only a few (depending on n,k and the dimensionnality) iterations are needed to reach convergence. The implementation of Lloyd's algorithm in Python is given below. # -*- coding: utf-8 -*- import matplotlib.pyplot as plt import matplotlib.cm as cm from math import sqrt, pow, ceil import numpy as np import random from itertools import repeat import time #################### # Functions #################### def file_len(fname): # number of lines in file fname with open(fname) as f: for i, l in enumerate(f): pass return i + 1 def loadData(myFile): #myFile = "workfile" nLines = file_len(myFile) f = open(myFile, "r") reader = f.readlines() i=0 for line in reader: parts = line.split(' ') dim = len(parts) if i 0: P = np.ndarray(shape=(nLines,dim),dtype='float') for d in xrange(dim): P[i,d] = float(parts[d]) i+=1 f.close() return P, nLines, dim def computeDist(X,Y): return sqrt(computeDistSquare(X,Y)) def computeDistSquare(X,Y): dim = len(X) sumsq = 0 for d in xrange(dim): sumsq += pow(X[d]-Y[d],2) return sumsq def initializationFctRnd(P, nPoints, dim, k): rand_smpl = random.sample(xrange(nPoints), k) M = np.zeros((k,dim), dtype=np.float) for i in xrange(k): M[i,:] = P[rand_smpl[i],:] return M def Cost(P,M,mOfp): C = 0.0 for p in xrange(nPoints): C += computeDistSquare(P[p,:], M[mOfp[p],:]) return C def FindClosestCentroid(p,M): distToM = float('inf') mClosest = -1 for m in xrange(k): dist = computeDist(P[p,:], M[m,:]) if dist < distToM: distToM = dist mClosest = m return mClosest def ComputeCentroids(L,M): Mnew = np.zeros((k,dim),dtype=np.float); for m in xrange(k): if len(L[m]) 0: Mnew[m,:] = M[m,:] else: for p in L[m]: Mnew[m,:] +=P[p,:]/len(L[m]) return Mnew ########################## # One K-means iteration ########################## def SimpleIterate(M,mOfp): L = [[] for i in repeat(None, k)] nReallocations = 0 for p in xrange(nPoints): # Assign each point to its closest centroid mClosest = FindClosestCentroid(p,M) if mOfp[p] != mClosest: nReallocations+=1 mOfp[p] = mClosest L[mClosest].append(p) Mnew = ComputeCentroids(L,M) # Update centroid position return Mnew,mOfp,nReallocations ########################## # Load data & metaparameters ########################## fileName = "1000Points4clusters" global P,nPoints,dim,k,nIt, plotFig [P, nPoints, dim] = loadData(fileName) k = 4 nIt = 10 plotFig=1 nFig = 0 ############# # Main loop ############# start_time = time.time() M = initializationFctRnd(P, nPoints, dim, k) mOfp = -1*np.ones(nPoints, dtype=np.int) itWhile = 0 nReallocations = -1 while nReallocations != 0 and itWhile < 2: itWhile +=1 [M,mOfp,nReallocations] = SimpleIterate(M,mOfp) print '' print 'Number of points reallocated at iteration ' + str(itWhile) + ' : ' + str(nReallocations) C = Cost(P,M,mOfp) print 'Cost at iteration '+ str(itWhile) + ' : ' + str(C) colors = cm.rainbow(np.linspace(0, 1, k)) nFig+=1 fig = plt.figure(nFig) fig.suptitle('Partitionning the Dataset in ' + str(k) + ' Clusters') plt.xlabel('x') plt.ylabel('y') for p in xrange(nPoints): plt.plot(P[p,0],P[p,1],'o',color=colors[mOfp[p]]) for cluster in xrange(k): plt.plot(M[cluster,0],M[cluster,1],'s',markersize=20,color=colors[cluster]) plt.show() print("elapsed time [s] : " + str(time.time() - start_time)) # UNCOMMENT FOR VISUALIZING ONE FIGURE PER ITERATION # SLOWS DOWN THE ALGO SIGNIFICANTLY """ colors = cm.rainbow(np.linspace(0, 1, k)) nFig+=1 fig = plt.figure(nFig) fig.suptitle('Partitionning the Dataset in ' + str(k) + ' Clusters') plt.xlabel('x') plt.ylabel('y') for p in xrange(nPoints): plt.plot(P[p,0],P[p,1],'o',color=colors[mOfp[p]]) for cluster in xrange(k): plt.plot(M[cluster,0],M[cluster,1],'s',markersize=20,color=colors[cluster]) plt.show() print("elapsed time [s] : " + str(time.time() - start_time)) """ ######################### # Print the final result ######################### """ colors = cm.rainbow(np.linspace(0, 1, k)) nFig+=1 fig = plt.figure(nFig) fig.suptitle('Partitionning the Dataset in ' + str(k) + ' Clusters') plt.xlabel('x') plt.ylabel('y') for p in xrange(nPoints): plt.plot(P[p,0],P[p,1],'o',color=colors[mOfp[p]]) for cluster in xrange(k): plt.plot(M[cluster,0],M[cluster,1],'s',markersize=20,color=colors[cluster]) plt.show() """ If we require an Euclidian space, the complexity of each iteration is in O(k\cdot n) . Indeed the assignement step (minimization of g_1^{(i)}(L) requires computing for each point, the distance for that point to every centroid. The second step is where the assumption of an Euclidian space matters: optimizing the position of each centroïd's position is simply the geometric mean of the points assigned to that centroïd. This takes O(|L_i|) for centroïd m_i , and thus O(n) for all centroïds. If we do not require an Euclidian space, then the second step is not easy anymore: what would for example be the mean of a set of strings ? We could define it as the string that minimizes the sum of the square "edit distances". '''Improving k-means and Lloyd's Algorithm''' '''Spectral Clustering : a Powerful Tool''' '''A Hybrid Algorithm : Combination of Apectral Clustering and k-means''' We know that k-means fails at detecting interlacing non-convex shapes. Hierachical algorithms can be used, however, they are quite slow. On the other hand, spectral clustering works well, but to use it, we need a similarity matrix. The idea is to use k-means in order to build a similarity : we run k-means a large number of times with each tilme a different k. The similarty matrix S is defined by S_{ij} = \text{ the number of time } i \text{ and } j \text{ are in the same cluster } Then perform a spectral clustering based on that similarity matrix. The idea is that the notion of similarity between points propagates from close to close. Let us show on an example what happens: suppose we want to cluster thes points: Clustering the points with k-means with different k's give thus kind of result: we display here a fictive resuly of haw k-means could behave with a very small k (k=2, yellow), a bigger k (red) and a very big k (green, these ones are not all displayed, but you get the idea...). We see here that taking large clusters (thus a small k) is not really relevant because it groups together points that we dont want to be similar. We will thus use larges k's. The key point is that the clusters obtained with one given k=k1 are overlapping with other clusters obtained when k=k2, k3, ... This will allow to propagate the notion of similarity between the branches of each spiral. Finally, using the spectral clustering on the similarity matrix obtained gives the following (non fictive !) result, so we can see that it works really well: To summarize, the Hybrid algorithm consist in applying step y step the following points: * For k \in \{k_1..k_N\} \doteq nKmeans , run k-means and store the points assignement (idx) * Initialize a similarity matrix S = zeros(n,n) , where n is the number of points, and fill it : for clusters=1:length(nKmeans) for i=1:nKmeans(clusters) indices=find(idx(:,clusters) i); S(indices,indices)=S(indices,indices)+1; end end *Perform a spectral clustering on the similarity matrix S '''Applications''' '''''Images''''' Besides our theorical reasonning, it is important for us to give a more practical aspect to our project, we interest ourselves in some nice applications of clustering in image processing. In particular, we will see how we can compress an image without losing too much quality, how we can detect, isolate and count similar items on a picture. ''Image Compression'' Image compression is a subject that has already been broadly studied because of its interest for a large public (Many people are interested in stocking a large amout of images and movies on their computers, smartphones, ...). We distinguish two kinds of image compression : lossy or not. While lossy compression may be prohibited in some application, like for example medical imaging, where every detail may count, however, unlossy compression is quite limited in the sense that it generally does not allow to compress significantly the data. Lossy compression, however is very useful in many case, where compression the image by a big factor may be worth losing some details. There are many techniques of lossy image compression, from the simplest (basically sampling the image) to the most complicated. The method we are going to present here is of course based on clustering : the idea is to reduce the number of colors used by clustering the pixels in the RGB-space and replace the RGB value of every pixel (x,y) by the RGB value of the centroïd of the cluster where (x,y) belongs. The rationale behind that is that points that neighbors in the (x,y)-plane are likely to have similar colors. Thus, after replacing the similar colors by a single one, which we will do thanks to clustering, the resulting picture will have big area with the same color, allowing to store the information in a mode compact form. Before going deeper into explanations, let us take a look at what the final result looks like: we achieve a compression by almost a factor 10 (49 kB instead of 420 kB) without dramatically decreasing the quality. K-means is particularly suited for clustering points in the RGB-space, because it a is fast (and quite intuitive) method. Moreover, its main deficiency, namely the difficulty to handle interlacing non-convex shapes, is not a problem here. Indeed, our perception of the colors is such that pixels with similar colors form a convex shape in the RGB-space. The figure below shows the clustering (with 10 clusters) of the pixels of the initial image in the RGB-space: We will now develop a little bit more the functionment of the method by showing the different steps on a very dummy and very simple example. First of all, let us introduce the notations: an image is fully represented by the position (x,y) of each pixel and the (R,G,B) components a set of points, i.e. (x,y,R,G,B) \in \{1..n_x\} \times \{1..n_y\} \times \{1..255\}^3 Our dummy example is the following: we have a 9 pixels image ''Im'', containing thus up to 9 different RGB values, and we would like to compress it by only keeping 2 colors, in order to only have 2 different RGB values to store. Im = \left(\begin{array}{lll} (100,200,200) & (110,210,210) & (120,220,220)\\ (10,20,20) & (11,21,21) & (12,22,22)\\ (110,220,220)& (100,200,200)& (120,210,210) \end{array}\right) Obviously, all the pixels of the first row and all the points from the third line are in the same cluster, with centroïd (110,210,210), while all the pixels of the second row belog to the second cluster, with centroid (11,21,21). We then assign to each pixel (x,y) the RGB value of the corresponding centroïd: Im = \left(\begin{array}{lll} (110,210,210)& (110,210,210) & (110,210,210)\\ (11,21,21)& (11,21,21) & (11,21,21)\\ (110,220,220)& (110,210,210)&(110,210,210) \end{array}\right) Which allows now format like ''.png, .jpg'', etc. to store the information in a more compact form, like for example, RGB(x,y)= \left\{\begin{array}{ll} (110,210,210) & \text{ if } y=1 \text{ or } 3\\ (10,20,20) & \text{ else }\\ \end{array}\right. ''Clustering non convex shapes'' A promising application of clustering techniques is the detection and the counting of "similar objects" on a picture. Imagine for example that the YellowCab company in Manhattan is interested to have some statistics on the number an spatial disposition of taxis at strategical places. To achieve this objective, the company has installed camera on the top of some skyscrapers, that periodicallly takes a picture of the road, but they do not want to analyse manually each picture. This situation may seem unrealistic, but why not... First of all, let us clarify the notion of similarity of objects: we understand that two objects are "similar" if (and only if) they have "similar" colors and shapes. Typically, this will be the case of the yellow taxis. The steps of the algoritm are the following : # Cluster the original picture in the RGB-space such that all the similar items have exacty the same color. The choice of the number of clusters is important : taking too few clusters would leed to confuse the color yellow (of the taxis) with other similar colors, like the white road markings. On the other hand, takin too many clusters may lead to separate the yellow color in two sort of yellow (if for example the shadow of the skyscraper makes some cabs appear darker than others). However, yellow is a quite distinguishable color, thus it is preferable to take too much clusters than too few. Typically 3 to 10 clusters works. But remember that the more cluster you want, the slower is k-means. 3,4 or 5 is thus a good choice. It is thus preferable to take to take too much than too few # Only keep the interresting points, i.e. the ones that have the targetted color (yellow in the case of yellow cabs). This can be done easily by measuring the (Euclidian) distance between every color in new image (i.e. every centroïd from step 1) and the RGB of yellow : (255,255,0). Store the position (x,y) of these interresting. # Cluster these points in the (x,y) space with different k, where k is the number of clusters, and minimize some error criterion in order to choose the right k. The clustering can be done with k-means if the objects have a "convex" shape. In fact they don't need to be convex stricto sensu: they can have holes (for example at the location of the windows) as long as each object can be enlarged in a convex shape in which the density of the colored points is "high enough". While not very rigourous, this criterion is typically observed in the case of taxis, or even objects more complex shapes (like planes, as we shall see later). The definition of an error criterion is obvisouly the most difficult part: clearly we have to penalize clusters that are too large (because the are likely to contain several taxis), but also penalize too small clusters, too avoid splitting a taxi in several clusters. Here is another example where we use our method to localize (at a given moment), distinguish and count the number of plane in a squadron. '''Supervised Clustering : a User Friendly Tool''' = '''Hierchical Clustering''' = '''Principle of Hierarchical Clustering''' '''CURE Algorithm''' '''Implementation''' '''Applications''' '''Conclusion''' = =